[Feature] Hierarchical reduction and warp reduction intrinsics support#1762
[Feature] Hierarchical reduction and warp reduction intrinsics support#1762LeiWang1999 merged 13 commits intotile-ai:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaces Hopper-specific AllReduce paths with NamedBarrier for SM>=90, adds hierarchical per-warp AllReduce logic and dynamic workspace sizing when reducing_threads > 32, extends CUDA warp_reduce and barrier primitives, and generalizes reduce tests to exercise multiple ops, dtypes, and thread counts. Changes
Sequence Diagram(s)sequenceDiagram
participant Lane as Lane (thread lane)
participant Warp as Warp (warp-level)
participant Workspace as Workspace (per-warp buffer)
participant AllReduce as AllReduce (reducer + barrier)
Lane->>Warp: compute partial per-lane value
Warp->>Warp: warp_reduce(value) (shuffle/intrinsics)
Warp->>Workspace: write per-warp result (hierarchical path)
AllReduce->>Workspace: read per-warp results
AllReduce->>AllReduce: inter-warp reduction (with NamedBarrier sync)
AllReduce-->>Lane: broadcast final reduced result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@testing/python/language/test_tilelang_language_reduce.py`:
- Around line 99-105: The ref_fn uses torch.uint32 in a static dtype list which
raises AttributeError on PyTorch <2.3.0; update ref_fn to conditionally include
torch.uint32 only when hasattr(torch, "uint32") (or use getattr with a fallback)
so the dtype check is built at runtime, mirroring the existing pattern used for
version-dependent dtypes like float8_e4m3fn; locate and modify the dtype
membership test in ref_fn to construct the list/set conditionally and then
perform the same res.to(A.dtype) conversion for the supported integer dtypes
(torch.uint32, torch.int32, torch.int64).
🧹 Nitpick comments (1)
testing/python/language/test_tilelang_language_reduce.py (1)
7-7: Consider scopingdisable_cache()to avoid global test side effects.
This flips a process-wide cache flag; if other tests run in the same session, they inherit the disabled cache. If that’s not intended, wrap it in a fixture/context that re-enables after this module.
|
Maybe we also need the performance regression tests. Where can I trigger them? @LeiWang1999 |
|
@regression-perf |
|
Thanks for your contribution! I've further investigated this problem this weekend and have some suggestions for the code:
|
|
@tzj-fxz Would you mind take a look. |
Thank you for the advice. I will further implement these features. |
redux PTX supportThere was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@src/tl_templates/cuda/reduce.h`:
- Around line 377-381: The fast-path incorrectly casts 64-bit integrals to
int32_t and uses run_reduce_sync, truncating results; update the compile-time
branch in the template so the run_reduce_sync path is only taken when T is
exactly int32_t or uint32_t (remove the branch that treats other
std::is_integral_v<T> types by casting), and let 64-bit types fall through to
the shfl_xor_sync-based fallback that correctly handles 64-bit operands; adjust
the condition around run_reduce_sync (the constexpr if that currently checks
std::is_same_v<T, int32_t> || std::is_same_v<T, uint32_t> and the following
std::is_integral_v<T> branch) accordingly so no cast-to-int32_t is performed for
64-bit integrals.
- Around line 344-358: The sm100 fast-path incorrectly casts floats and calls
__reduce_max_sync/__reduce_min_sync which only accept 32-bit integer operands;
modify the branch in reduce.h so the __reduce_*_sync calls are only compiled for
32-bit integer types (e.g., when std::is_same_v<T,int> or
std::is_same_v<T,unsigned int> / std::is_integral and sizeof(T)==4), and remove
the float/half/bfloat casting there; for all other types (float, half_t,
bfloat16_t, etc.) fall through to the existing non-sm100 handling. Ensure you
static_cast the value to int/unsigned before passing to
__reduce_max_sync/__reduce_min_sync and keep the ReduceOp/MaxOp/MinOp constexpr
checks to select the right intrinsic.
|
As far as I know, It's indeed complex. It seems that |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/tl_templates/cuda/reduce.h`:
- Around line 344-366: The SM100 asm fast-path currently casts any T to float
and uses redux.sync.*.f32, which corrupts large integer ordering; change the
compile-time guard inside the SM100 block to only take this fast-path for
floating types (float, half_t, bfloat16_t) — e.g. wrap the asm branches with an
if constexpr that checks std::is_floating_point_v<T> || std::is_same_v<T,
half_t> || std::is_same_v<T, bfloat16_t> so integral types fall through to the
SM80+ __reduce_max_sync/__reduce_min_sync implementations; keep the existing
value_cast and asm/result handling for the floating case and do not alter the
fallback path for non-floating T.
Fixed in the latest commit. :) |
LeiWang1999
left a comment
There was a problem hiding this comment.
Would be better to have some benchmark results
|
@regression-perf |
Performance Regression Test ReportTriggered by: @tzj-fxz Results
Artifacts
|
|
Threads=128. One block.
The regression performance test also shows that |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@src/tl_templates/cuda/reduce.h`:
- Around line 344-365: The fast-path in the CUDA reduce code leaves value_cast
as 0.0f for T=float (and mishandles double), so the inline asm reduces zeros;
update the branches in the block inside reduce.h (the section that sets
value_cast and emits redux.sync) to handle float explicitly by assigning
value_cast = value when std::is_same_v<T, float>, keep the existing assignments
for half_t and bfloat16_t, and ensure the fast-path is enabled only for the
three safe types (float, half_t, bfloat16_t) — do NOT include double — before
emitting the MaxOp/MinOp asm so the asm uses the real value_cast.
- Around line 368-382: The template lambda run_reduce_sync uses C++20-only
syntax and must be replaced for C++17: implement a regular templated helper
function (or a small function object) named e.g. run_reduce_sync_impl templated
on typename T_cast that contains the same if constexpr branches for ReduceOp vs
SumOp/MaxOp/MinOp/BitAndOp/BitOrOp/BitXorOp and calls the corresponding
__reduce_*_sync(mask, val), then call that helper from the existing site instead
of the lambda; ensure the helper has the same capture visibility (pass mask in
as a parameter or make it accessible) and is declared in the same scope so
ReduceOp and the __reduce_*_sync calls resolve correctly.
🧹 Nitpick comments (1)
src/tl_templates/cuda/reduce.h (1)
130-189:run_hopperduplicatesrunexcept for barrier calls — consider unifying.The hierarchical reduction logic (warp reduce → shared-memory exchange → butterfly) is copy-pasted between
runandrun_hopper, differing only in barrier synchronization (__syncthreads()vsbar.syncasm). A barrier-policy template parameter or a shared helper parameterized on a sync functor would eliminate this ~60-line duplication and reduce maintenance risk when the reduction logic is updated.
|
May you add some more tests? i.e:
|
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@testing/python/language/test_tilelang_language_reduce.py`:
- Around line 46-61: The test's dispatch on op (variable op) lacks a final else
branch so unsupported op values leave B_local uninitialized and produce silent
garbage; update the branch in the block that calls T.reduce_sum / T.reduce_max /
T.reduce_min / T.reduce_abssum / T.reduce_absmax / T.reduce_bitand /
T.reduce_bitor / T.reduce_bitxor to include a final else that raises an explicit
exception (e.g., raise ValueError or AssertionError) including the invalid op
string so the test fails loudly and identifies the unsupported operation.
- Around line 98-111: ref_fn currently lacks branches for the bitwise ops
(bitand, bitor, bitxor) so res can be unassigned; add branches inside ref_fn to
handle "bitand", "bitor", and "bitxor" by performing a column-wise reduction
over dim=1 (looping over columns and applying the appropriate
torch.bitwise_and/bitwise_or/bitwise_xor) and for "bitand" initialize the
accumulator with an all-ones tensor of the same dtype/shape as a row (use
~torch.zeros_like(row) or equivalent) so types like torch.uint32/int32/int64
behave correctly; keep the existing dtype-preservation logic (the check using
A.dtype in [...]) and return the reduced tensor cast back when needed.
🧹 Nitpick comments (2)
testing/python/language/test_tilelang_language_reduce.py (2)
133-138: Bitwise reduce ops (bitand,bitor,bitxor) are dispatched in the kernel but never tested.The kernel builder supports these ops (lines 56-61), yet
test_reduce_other_oponly covers["max", "min", "abssum", "absmax"]. Adding at least one test per bitwise op (on integer dtypes) would catch regressions in the new warp-reduce paths. This also aligns with the reviewer request for broader operation coverage.
88-94:mode="ss"artificially restricted toop="sum"despite existing shared builders for other ops.Shared-reduce builders already exist for
max,min,abssum, andabsmax(lines 71-84). Therun_reducefunction could dispatch to the right one instead of hard-codingreduce_sum_ss. This would allow the samerun_reduceinterface to cover shared-mode tests uniformly.Sketch
elif mode == "ss": - assert op == "sum", f"shared reduce only supports sum, got {op}" - program = reduce_sum_ss(M, N, dtype) + ss_builders = { + "sum": reduce_sum_ss, + "max": reduce_max_ss, + "min": reduce_min_ss, + "abssum": reduce_abssum_ss, + "absmax": reduce_absmax_ss, + } + if op not in ss_builders: + raise NotImplementedError(f"shared reduce not implemented for op={op}") + program = ss_builders[op](M, N, dtype)
|
I have no further concerns. Shall we test and merge it? @LeiWang1999 |
|
I don't have a ROCm machine on hand to reproduce the error. The Hopper passes this test. It seems weird because the test function |
Replace the duplicated run()/run_hopper() methods in AllReduce with a single run() that accepts a Barrier policy template parameter (SyncThreadsBarrier or NamedBarrier<N>). Extract the shared inter-warp reduction logic into a warp_inter_reduce helper, and split the dispatch into private hierarchical_reduce/butterfly_reduce methods. Update codegen in reduce.cc and finalize_reducer.cc to emit NamedBarrier<all_threads> for SM >= 90 targets instead of the old all_threads + run_hopper pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@src/op/reduce.cc`:
- Around line 392-408: The hierarchical-workspace sizing is wrong when
reducing_threads < all_threads: is_hierarchical currently makes workspace_size =
reducing_threads / 32, but threads use global_warp_id = (threadIdx.x -
thread_offset) / 32 which can index up to (all_threads-1)/32, causing
out-of-bounds writes in hierarchical_reduce; fix by sizing the workspace to
cover all warps in the block when is_hierarchical (use the block thread extent
from T.thread_bounds->extent, e.g. workspace_size =
static_cast<int>(*as_const_int(T.thread_bounds->extent) + 31) / 32 or equivalent
integer division rounding up), and keep using T.AddWorkspace and
thread_reduce_args as before (symbols to edit: is_hierarchical,
reducing_threads, T.thread_bounds->extent, T.AddWorkspace, hierarchical_reduce,
global_warp_id, thread_offset).
In `@src/tl_templates/cuda/reduce.h`:
- Around line 65-78: The comment notes that hierarchical_reduce now uses barrier
phases 1, 2 and 3, so barrier IDs 1–3 are reserved for internal use; update any
user-facing barrier documentation and comments that currently state "user
barriers start from 3" to instead state that user-defined barrier IDs must start
from 4, and add a short note near the barrier policy declarations
(SyncThreadsBarrier and NamedBarrier) and in the hierarchical_reduce
documentation mentioning that hierarchical_reduce occupies phases 1–3 to avoid
future sync conflicts.
- Around line 140-164: hierarchical_reduce can write past red_buf when the
caller uses fewer "reducing_threads" than the full block (global_warp_id may
exceed reducing_threads/32); fix by adding bounds checks or by having the caller
allocate red_buf sized to total global warps: ensure before writing
red_buf[global_warp_id] and before returning/reading red_buf[group_id *
num_warps_per_group] that global_warp_id and group_id*... are within the
allocated workspace (or clamp/skips writes for warps >= reducing_warps), and
verify warp_inter_reduce is only invoked for groups that have valid entries;
update the caller that allocates red_buf (and any code that computes
reducing_threads) to size it to ceil(reducing_threads/32) or to total global
warps consistently.
🧹 Nitpick comments (1)
src/op/finalize_reducer.cc (1)
113-117: Minor inconsistency: workspace allocation threshold is>= 32here vs> 32inreduce.cc.In
reduce.cc(line 401), workspace is only allocated whenreducing_threads > 32, while here it uses>= 32. Whenreducing_threads == 32,AllReduce::rundispatches towarp_reducewhich doesn't usered_buf, so the allocation is harmless but wastes shared memory. Consider aligning the threshold to> 32for consistency.♻️ Suggested fix
- if (reducing_threads >= 32) { + if (reducing_threads > 32) {
src/op/reduce.cc
Outdated
| bool is_hierarchical = [&]() { | ||
| if (reducing_threads <= 32) | ||
| return false; | ||
| if (reducing_threads % 32 != 0) | ||
| return false; | ||
| if (*scale != 1) | ||
| return false; | ||
| return true; | ||
| }(); | ||
| if (reducing_threads > 32) { | ||
| PrimExpr workspace = T.AddWorkspace( | ||
| *as_const_int(T.thread_bounds->extent), clear_buffer->dtype); | ||
| int workspace_size = | ||
| is_hierarchical | ||
| ? reducing_threads / 32 | ||
| : static_cast<int>(*as_const_int(T.thread_bounds->extent)); | ||
| PrimExpr workspace = | ||
| T.AddWorkspace(workspace_size, clear_buffer->dtype); | ||
| thread_reduce_args.push_back(workspace); |
There was a problem hiding this comment.
Critical: workspace undersized for hierarchical reduction when reducing_threads < all_threads.
In hierarchical_reduce (reduce.h, lines 146-153), global_warp_id = (threadIdx.x - thread_offset) / 32 ranges from 0 to (all_threads - 1) / 32. When reducing_threads < all_threads (e.g., the reduce axis only covers a subset of threads), the workspace is allocated as reducing_threads / 32 but threads in higher groups write to red_buf[global_warp_id] at indices beyond that size — an out-of-bounds shared memory write.
Example: all_threads=256, reducing_threads=64, scale=1 → workspace_size=2, but global_warp_id can reach 7.
The fix is to size the hierarchical workspace to cover all warps in the block:
🐛 Proposed fix
int workspace_size =
is_hierarchical
- ? reducing_threads / 32
+ ? static_cast<int>(*as_const_int(T.thread_bounds->extent)) / 32
: static_cast<int>(*as_const_int(T.thread_bounds->extent));🤖 Prompt for AI Agents
In `@src/op/reduce.cc` around lines 392 - 408, The hierarchical-workspace sizing
is wrong when reducing_threads < all_threads: is_hierarchical currently makes
workspace_size = reducing_threads / 32, but threads use global_warp_id =
(threadIdx.x - thread_offset) / 32 which can index up to (all_threads-1)/32,
causing out-of-bounds writes in hierarchical_reduce; fix by sizing the workspace
to cover all warps in the block when is_hierarchical (use the block thread
extent from T.thread_bounds->extent, e.g. workspace_size =
static_cast<int>(*as_const_int(T.thread_bounds->extent) + 31) / 32 or equivalent
integer division rounding up), and keep using T.AddWorkspace and
thread_reduce_args as before (symbols to edit: is_hierarchical,
reducing_threads, T.thread_bounds->extent, T.AddWorkspace, hierarchical_reduce,
global_warp_id, thread_offset).
src/tl_templates/cuda/reduce.h
Outdated
| private: | ||
| template <typename T> | ||
| static TL_DEVICE T run_hopper(T x, T *red_buf = nullptr) { | ||
| static TL_DEVICE T hierarchical_reduce(T x, T *red_buf) { | ||
| x = warp_reduce<T>(x, Reducer()); | ||
|
|
||
| constexpr int num_warps_per_group = threads / 32; | ||
| const int global_warp_id = (threadIdx.x - thread_offset) / 32; | ||
| const int group_id = (threadIdx.x - thread_offset) / threads; | ||
| const int warp_id_in_group = global_warp_id % num_warps_per_group; | ||
| const int lane_id = threadIdx.x % 32; | ||
|
|
||
| Barrier::template sync<1>(); | ||
| if (lane_id == 0) { | ||
| red_buf[global_warp_id] = x; | ||
| } | ||
| Barrier::template sync<2>(); | ||
|
|
||
| if (warp_id_in_group == 0) { | ||
| const int group_base_warp = group_id * num_warps_per_group; | ||
| warp_inter_reduce<Reducer, num_warps_per_group>(red_buf, group_base_warp, | ||
| lane_id); | ||
| } | ||
| Barrier::template sync<3>(); | ||
| return red_buf[group_id * num_warps_per_group]; | ||
| } |
There was a problem hiding this comment.
Hierarchical reduce: verify workspace sizing in the caller.
The implementation itself is correct — global_warp_id correctly indexes across all warps in the block, and the group-based structure properly partitions the reduction. However, see the comment on src/op/reduce.cc lines 392-408 regarding workspace sizing: when reducing_threads < all_threads, global_warp_id can exceed reducing_threads / 32, causing an OOB write into red_buf.
🤖 Prompt for AI Agents
In `@src/tl_templates/cuda/reduce.h` around lines 140 - 164, hierarchical_reduce
can write past red_buf when the caller uses fewer "reducing_threads" than the
full block (global_warp_id may exceed reducing_threads/32); fix by adding bounds
checks or by having the caller allocate red_buf sized to total global warps:
ensure before writing red_buf[global_warp_id] and before returning/reading
red_buf[group_id * num_warps_per_group] that global_warp_id and group_id*... are
within the allocated workspace (or clamp/skips writes for warps >=
reducing_warps), and verify warp_inter_reduce is only invoked for groups that
have valid entries; update the caller that allocates red_buf (and any code that
computes reducing_threads) to size it to ceil(reducing_threads/32) or to total
global warps consistently.
This update introduces a new case in the warp_reduce function to cast non-float types to float using static_cast. This enhancement improves type flexibility and ensures compatibility with a broader range of input types during reduction operations.
|
@regression-perf |
Performance Regression Test ReportTriggered by: @LeiWang1999 Results
Artifacts
|
|
@regression-perf |
Performance Regression Test ReportTriggered by: @tzj-fxz Results
Artifacts
|
For #1761
redux.syncPTX templates to support faster reduction on (u)int32 with__CUDA_ARCH__>=800Summary by CodeRabbit
Performance
Bug Fixes
Tests